Regression Analysis and Feature Selection

先来复习点基础概念

RSS (Residual Sum of Squares)：残差平方和
$RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$
TSS (Total Sum of Squares)：总平方和
$TSS = \sum_{i=1}^{n} (y_i - \bar{y})^2$
ESS (Explained Sum of Squares)：回归平方和
$ESS = \sum_{i=1}^{n} (\hat{y}_i - \bar{y})^2$

==> $TSS = ESS + RSS$

MSE (Mean Squared Error)：均方误差
$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$
Variance：方差
$Var(y) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})^2$

==> $R^2 = VE = 1 - \frac{RSS}{TSS} = 1 - \frac{MSE}{Var(Y)}$

线性回归

线性回归的假设

No Outliers: 没有离群值
No Multicollinearity: 没有多重共线性 - 自变量之间的高相关性
No Heteroscedasticity: 没有异方差性

Hetroscedasticity （异方差性）: 指的是残差随着自变量的变化而变化，也就是残差的方差不是常数

Fixing broken assumptions （解决假设不成立的情况）

Outliers: filter掉
Multicollinearity: 从高相关的自变量中选择一个作为代表
Heteroscedasticity: 使用log或sqrt等方法进行变换，转化数据以减少变化的不一致性

特征选择

Statistical Test of Significance （统计显著性检验）

Hypothesis

特征的真实参数值为0 - F-test
特征对模型不是必要的 - T-test

Test Method

T-test: 关注单个特征的重要性
F-test: 关注多个特征组合的重要性 OR 在特征较多时，关注某个特征对模型的贡献

不理解就如图所示：
T-test & F-test
T-test里面就是拆解开，color[red], color[rose], color[white]这几个参数各自对模型的重要性；
而F-test里面就是c这一个参数，也就是所有的颜色的集合，这个属性，对模型的重要性。

Why Feature Selection?

Feature Variable ↑ =>
- computational Cost ↑
- samples required to fit the model ↑
- data sparsity ↑
Data expensive to collect & store
Not all features are useful
维度太高没法可视化

Feature Selection Methods

Filter Methods

Unsupservised: 通过特征之间的关系来选择特征 -- e.g. 如果两个特征高度相关，选一个做代表
Supervised: 通过特征与目标变量之间的关系来选择特征 -- e.g. 删掉相关性低的

Wrapper Methods

Embedded Methods

Use models that have built-in feature selection:

Lasso Regression
Random Forest / Decision Trees

Feature extraction

The building of derived features, or the inference of hidden features, from an initial dataset（从原始数据集中构建派生特征，或从中推断隐藏特征）

Neural Networks
Projection Methods: PCA, SVD...

先来复习点基础概念​

线性回归​

线性回归的假设​

Fixing broken assumptions （解决假设不成立的情况）​

特征选择​

Statistical Test of Significance （统计显著性检验）​

Hypothesis​

Test Method​

Why Feature Selection?​

Feature Selection Methods​

Filter Methods​

Wrapper Methods​

Embedded Methods​

Feature extraction​